Descriptive Statistics of the Genome: Phylogenetic Classification of Viruses

نویسندگان

  • Troy Hernandez
  • Jie Yang
چکیده

The typical process for classifying and submitting a newly sequenced virus to the NCBI database involves two steps. First, a BLAST search is performed to determine likely family candidates. That is followed by checking the candidate families with the pairwise sequence alignment tool for similar species. The submitter's judgment is then used to determine the most likely species classification. The aim of this article is to show that this process can be automated into a fast, accurate, one-step process using the proposed alignment-free method and properly implemented machine learning techniques. We present a new family of alignment-free vectorizations of the genome, the generalized vector, that maintains the speed of existing alignment-free methods while outperforming all available methods. This new alignment-free vectorization uses the frequency of genomic words (k-mers), as is done in the composition vector, and incorporates descriptive statistics of those k-mers' positional information, as inspired by the natural vector. We analyze five different characterizations of genome similarity using k-nearest neighbor classification and evaluate these on two collections of viruses totaling over 10,000 viruses. We show that our proposed method performs better than, or as well as, other methods at every level of the phylogenetic hierarchy. The data and R code is available upon request.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evolution of viruses and cells: do we need a fourth domain of life to explain the origin of eukaryotes?

The recent discovery of diverse very large viruses, such as the mimivirus, has fostered a profusion of hypotheses positing that these viruses define a new domain of life together with the three cellular ones (Archaea, Bacteria and Eucarya). It has also been speculated that they have played a key role in the origin of eukaryotes as donors of important genes or even as the structures at the origi...

متن کامل

A Novel Genetic classification of SARS coronavirus-2 following whole nucleic acid and protein alignment of the isolated viruses

Background and aims: The end of 2019 has marked the year, which the human population encountered a novel virus; SARS-CoV-2 that causes a disease namely COVID-19. Here we focused on the genome and protein mutations and subsequently suggested a new classification of the SARS-CoV-2. Materials and Methods: Our study showed that some extra positions in the virus genome play a key role in the SARS-C...

متن کامل

Molecular Characterization and Phylogenetic Study of Newcastle Disease Viruses Isolated in Iran, 2014–2015

Newcastle disease (ND) is a highly contagious disease that affects many species of birds and causes significant economic losses to the poultry industry worldwide and the pathogenicity of Newcastle disease virus (NDV) strains varies with different virulence. Samples were collected from chicken commercial farms in Iran during 2014–2015. ND virus were characterized (NDV) by partial sequences...

متن کامل

S7 gene Characterization of Bluetongue Viruses in Iran

  Bluetongue is an infectious disease that primarily affects sheep. But due to serious socioeconomic consequence of it outbreaks on the international trade it has been included in the OIE notifiable diseases (list A). During 2007-8, total number of 130 blood samples gathered from suspected sheep to bluetongue disease in seropositive region including Khuzestan, Kurdistan, Fars, Ilam and Qum prov...

متن کامل

Characterization of Pigeon Paramyxovirus Type 1 Viruses (PPMV-1) Isolated from Iran

Newcastle disease (ND) is one of the contagious viral diseases in avian species. Recently, several ND outbreaks in pigeon caused by pigeon paramyxovirus serotype-1 (PPMV-1) have been reported in limited numbers from Iran and phylogenetic studies have been conducted on partial sequence of NDV fusion (F) gene. In the present study, ten PPMV-1, named Pigeon_paramyxovirus1_isolate_pigeon/Iran/UT_EG...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of computational biology : a journal of computational molecular cell biology

دوره 23 10  شماره 

صفحات  -

تاریخ انتشار 2016